Predicting Mental Health Problems from Social Determinants of Health and Caregiving Activities of Caregivers of Persons with Dementia: A Machine Learning Approach

BMIN503/EPID600 Final Project

Author

Hannah Cho


1 Overview

In the United States, as of 2021, approximately 34 million informal caregivers were providing unpaid support to persons living with dementia (PLWD). As dementia progresses to its end-stage or terminal phase, individuals with the condition often become fully dependent on caregivers for assistance with most daily activities, such as bathing, repositioning, and other essential personal care needs. Notably, 80% of PLWD receive care from informal caregivers—primarily spouses, adult children, and close friends—within community settings rather than institutionalized environments. The project addressed public health issues related to dementia caregiving, which directly affect caregivers’ emotional problems, such as depression and anxiety, by predicting key caregiving activities and sociodemographic features of those at high risk.

This project was conducted after consultations with Dr. George Demiris, a recognized expert in the field of dementia care and caregiving, and Dr. Huang, an experienced biostatistician. Their contributions provided critical insights into both the substantive and methodological aspects of the research, ensuring a rigorous and well-informed approach to addressing the challenges faced by caregivers of PLWD.

2 Introduction

In the United States, as of 2021, 34 million informal caregivers were providing unpaid support to persons living with dementia (PLWD; Alzheimer’s Association, 2023; McCabe et al., 2016; Reinhard et al., 2023). As PLWD progress to the end-stage or terminal stage of dementia, they often become dependent on assistance for most daily activities, including bathing and changing position. Of all PLWD, 80% receive care from individuals—often spouses, adult children, and close friends—in the community (Alzheimer’s Association, 2023).

Dementia caregivers face profoundly complex and multifaceted challenges that extend beyond the physical and emotional demands of caregiving. These challenges encompass physical tasks, such as providing continuous care, assisting with activities of daily living, and managing the multifarious health complications associated with dementia. However, the toll of caregiving is not solely physical; it also places a considerable emotional burden on caregivers. Many experience elevated levels of stress, anxiety, depression, and isolation, as the exhaustive nature of caregiving leaves little opportunity for self-care or social engagement.

Caregivers for persons living with dementia (PLWD) are particularly vulnerable to anxiety and depression due to the progressive and unpredictable trajectory of the disease. This trajectory often varies significantly based on individuals’ pre-existing conditions and comorbidities, adding to caregivers’ uncertainty and emotional strain.

Beyond these personal and emotional burdens, caregivers frequently encounter substantial social and systemic barriers that hinder their ability to access necessary support. These barriers include financial strain, a lack of accessible and affordable respite care, limited awareness of available resources, and cultural or social stigmas surrounding caregiving. Furthermore, caregivers often experience isolation due to diminished social engagement, compounding the psychological toll.

This study aims to explore how specific caregiving activities, individual characteristics, and sociodemographic factors predict the likelihood of anxiety and depression in dementia caregivers. Additionally, it investigates how effectively machine learning models can forecast these mental health outcomes, offering a novel approach to identifying caregivers at greatest risk and informing targeted interventions.

Problem Statement: Dementia caregiving is a demanding role, often accompanied by significant psychological challenges, including anxiety and depression. However, the specific caregiving activities, caregiver characteristics, and sociodemographic factors that most strongly predict these mental health outcomes remain insufficiently understood. Furthermore, while traditional statistical approaches have provided valuable insights, the potential of machine learning models to accurately forecast anxiety and depression in dementia caregivers remains underexplored. This research aims to identify the key predictors of these mental health outcomes and evaluate the predictive accuracy of machine learning algorithms in supporting early identification and targeted interventions for at-risk caregivers.

Research Question: This research aims to identify the key predictors of mental health outcomes and evaluate the predictive accuracy of machine learning algorithms in supporting the early identification of at-risk caregivers.

3 Methods

Dataset: This study used data from the National Health and Aging Trends Study (NHATS) Round 11 and the National Study of Caregiving (NSOC) Round 4, which include data collected in 2021. The NHATS is a publicly accessible dataset that includes a nationally representative sample of adults aged 65 years and older who are Medicare beneficiaries in the United States of America. The NHATS began in 2011 and included 8,245 participants who have been followed up annually since then; the study goal is to encourage research to maximize health and enhance the quality of life of older adults. The NSOC is conducted alongside the NHATS; participants in the NSOC are caregivers for older adults included in the NHATS. Both the NHATS and the NSOC were funded by the National Institute on Aging (R01AG062477; U01AG032947). When used together, the NHATS and NSOC provide valuable information on dyads of older adults receiving care and their family caregivers.

Samples: PLWD with dementia: Probable dementia was identified based on one of the following criteria: a self-reported diagnosis of dementia or Alzheimer’s disease by a physician, a score of 2 or higher on the AD8 screening instrument administered to proxy respondents, or a score that is 1.5 standard deviations below the mean on a range of cognitive tests.Caregivers: Caregivers are identified from the NSOC and NHATS data set. Since this project specifically aims to explore caregivers of persons with dementia in the community, the sample was further filtered through dementia classification (demclass) and residency (r11dresid).

Afer retriving NHATS Round 11 and NSOC ROUND 4, I specifically selected the sample (from NHATS R11- r11demclas). And then, I merged those necessary datasets.

  • As the purpose of this study was to examine family caregivers who provide care to persons living with dementia at home, we selected those who provided care to persons living with dementia and lived at home using the Dementia Classification with Programming Statements provided by the NHATS.

  • The final sample include 536 dyads.

    3.1 Loading Packages and Bringing Dataset

#Loading necessary packages first.
library(haven) #dta file
library(dplyr) #data cleaning

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2) #data visualization 
library(gtsummary) #summary statistics

#Bring datasets 
df1 <- read_dta("~/R  HC/BMIN503_Final_Project/final final/NHATS_Round_11_SP_File_V2.dta") # dementia classfication in this file
df2 <- read_dta("~/R  HC/BMIN503_Final_Project/final final/NSOC_r11.dta") #caregiver information 1
df3 <- read_dta("~/R  HC/BMIN503_Final_Project/final final/NSOC_cross.dta") #caregiver information 2
df4 <- read_dta("~/R  HC/BMIN503_Final_Project/final final/NHATS_Round_11_OP_File.dta") #older adults information 

3.2 Choosing Probable and Possible Dementia Participants

#need to clean df1 first in order to classify dementia classes  
#ENTER WHICH ROUND?
sp1 <- df1 |>
  mutate(rnd = 11) 

#3. EDIT ROUND NUMBER INSIDE THE QUOTES 
#(THIS REMOVES THE PREFIXES ON NEEDED VARIABLES ) 
sp1 <- sp1 |>
  rename_all(~stringr::str_replace(.,"^r11","")) |>
  rename_all(~stringr::str_replace(.,"^hc11","")) |>
  rename_all(~stringr::str_replace(.,"^is11","")) |>
  rename_all(~stringr::str_replace(.,"^cp11","")) |> 
  rename_all(~stringr::str_replace(.,"^cg11",""))

#ADD R1DAD8DEM AND SET TO -1 FOR ROUND 1 BECAUSE THERE IS NO PRIOR DIAGNOSIS IN R1
sp1 <- sp1 |>
  mutate(dad8dem = ifelse(rnd == 1, -1, dad8dem))


#ADD R1DAD8DEM AND SET TO -1 FOR ROUND 1 BECAUSE THERE IS NO PRIOR DIAGNOSIS IN R1
sp1 <- sp1 |> 
  mutate(dad8dem = ifelse(rnd == 1, -1, dad8dem))

#SUBSET NEEDED VARIABLES
df<-sp1 |> 
  dplyr::select(spid, rnd, dresid, resptype, disescn9, chgthink1, chgthink2, chgthink3, chgthink4, chgthink5, chgthink6, chgthink7, chgthink8, dad8dem,
                speaktosp, todaydat1, todaydat2, todaydat3, todaydat4, todaydat5, presidna1, presidna3, vpname1, vpname3, quesremem, dclkdraw, atdrwclck, 
                dwrdimmrc, dwrdlstnm, dwrddlyrc)

#FIX A ROUND 2 CODING ERROR#
df <- df |>
  mutate(dwrdimmrc = ifelse(dwrdimmrc==10 & dwrddlyrc==-3 & rnd==2, -3, dwrdimmrc))

#CREATE SELECTED ROUND DEMENTIA CLASSIFICATION VARIABLE 
df <- df |>
  mutate(demclas  =  ifelse(dresid==3 | dresid==5 | dresid==7, -9, #SET MISSING (RESIDENTIAL CARE FQ ONLY) AND N.A. (NURSING HOME RESIDENTS, DECEASED)
                            ifelse((dresid==4 & rnd==1) | dresid==6 | dresid==8, -1,                #SET MISSING (RESIDENTIAL CARE FQ ONLY) AND N.A. (NURSING HOME RESIDENTS, DECEASED)
                                   ifelse((disescn9==1 | disescn9==7) &           #CODE PROBABLE IF DEMENTIA DIAGNOSIS REPORTED BY SELF OR PROXY*
                                            (resptype==1 | resptype==2), 1, NA))))

#CODE AD8_SCORE*
#INITIALIZE COUNTS TO NOT APPLICABLE*
#ASSIGN VALUES TO AD8 ITEMS IF PROXY AND DEMENTIA CLASS NOT ALREADY ASSIGNED BY REPORTED DIAGNOSIS 
for(i in 1:8){
  df[[paste("ad8_", i, sep = "")]]  <- as.numeric(ifelse(df[[paste("chgthink", i, sep = "")]]==2 & df$resptype==2 & is.na(df$demclas), 0, #PROXY REPORTS NO CHANGE
                                                         ifelse((df[[paste("chgthink", i, sep = "")]]==1 | df[[paste("chgthink", i, sep = "")]] == 3) & df$resptype==2 & is.na(df$demclas), 1, #PROXY REPORTS A CHANGE OR ALZ/DEMENTIA*
                                                                ifelse(df$resptype==2 & is.na(df$demclas), NA, -1))))    #SET TO NA IF IN RES CARE AND demclass=., OTHERWISE AD8 ITEM IS SET TO NOT APPLICABLE                                                                                                                        
}

#INITIALIZE COUNTS TO NOT APPLICABLE*
for(i in 1:8){
  df[[paste("ad8miss_", i, sep = "")]]  <- as.numeric(ifelse(is.na(df[[paste("ad8_", i, sep = "")]]), 1,
                                                             ifelse((df[[paste("ad8_", i, sep = "")]]==0 | df[[paste("ad8_", i, sep = "")]]==1) & df$resptype==2 & is.na(df$demclas), 0, -1)))
}

for(i in 1:8){
  df[[paste("ad8_", i, sep = "")]] <- as.numeric(ifelse(is.na(df[[paste("ad8_", i, sep = "")]]) & is.na(df$demclas) & df$resptype==2, 0, df[[paste("ad8_", i, sep = "")]]))
}

#COUNT AD8 ITEMS
#ROUNDS 2+
df <- df |>
  mutate(ad8_score = ifelse(resptype==2 & is.na(demclas), (ad8_1 + ad8_2 + ad8_3 + ad8_4 + ad8_5 + ad8_6 + ad8_7 + ad8_8), -1)) %>% 
  #SET PREVIOUS ROUND DEMENTIA DIAGNOSIS BASED ON AD8 TO AD8_SCORE=8 
  mutate(ad8_score = ifelse(dad8dem==1 & resptype==2 & is.na(demclas), 8, ad8_score))  %>% 
  #SET PREVIOUS ROUND DEMENTIA DIAGNOSIS BASED ON AD8 TO AD8_SCORE=8 FOR ROUNDS 4-9
  mutate(ad8_score = ifelse(resptype==2 & dad8dem==-1 & chgthink1==-1 & (rnd>=4 & rnd<=9) & is.na(demclas) , 8, ad8_score)) 

#COUNT MISSING AD8 ITEMS
df <- df |> 
  mutate(ad8_miss = ifelse(resptype==2 & is.na(demclas),(ad8miss_1+ad8miss_2+ad8miss_3+ad8miss_4+ad8miss_5+ad8miss_6+ad8miss_7+ad8miss_8), -1))

#CODE AD8 DEMENTIA CLASS 
#IF SCORE>=2 THEN MEETS AD8 CRITERIA
#IF SCORE IS 0 OR 1 THEN DOES NOT MEET AD8 CRITERIA
df <- df |> 
  mutate(ad8_dem = ifelse(ad8_score>=2, 1,
                          ifelse(ad8_score==0 | ad8_score==1 | ad8_miss==8, 2, NA)))

#UPDATE DEMENTIA CLASSIFICATION VARIABLE WITH AD8 CLASS
df <- df |> 
  #PROBABLE DEMENTIA BASED ON AD8 SCORE  
  mutate(demclas = ifelse(ad8_dem==1 & is.na(demclas), 1, 
                          #NO DIAGNOSIS, DOES NOT MEET AD8 CRITERION, AND PROXY SAYS CANNOT ASK SP COGNITIVE ITEMS*
                          ifelse(ad8_dem==2 & speaktosp==2 & is.na(demclas), 3, demclas)))


####CODE DATE ITEMS AND COUNT 
#CODE ONLY YES/NO RESPONSES: MISSING/NA CODES -1, -9 LEFT MISSING*
#2: NO/DK OR -7: REFUSED RECODED TO : NO/DK/RF*
#****ADD NOTES HERE ABOUT WHAT IS HAPPENING IN ROUNDS 1-3, 5+ VS. ROUND 4 
#*
for(i in 1:5){
  df[[paste("date_item", i, sep = "")]]  <- as.numeric(ifelse(df[[paste("todaydat", i, sep = "")]]==1, 1,
                                                              ifelse(df[[paste("todaydat", i, sep = "")]]==2 | df[[paste("todaydat", i, sep = "")]]== -7, 0, NA)))
}

#COUNT CORRECT DATE ITEMS
df <- df |>
  mutate(date_item4 = ifelse(rnd==4, date_item5, date_item4)) %>% 
  mutate(date_sum = date_item1 + date_item2 + date_item3 + date_item4) %>% 
  
  #PROXY SAYS CAN'T SPEAK TO SP
  mutate(date_sum = ifelse(speaktosp==2 & is.na(date_sum),-2,  
                           #PROXY SAYS CAN SPEAK TO SP BUT SP UNABLE TO ANSWER*
                           ifelse((is.na(date_item1) | is.na(date_item2) | is.na(date_item3) | is.na(date_item4)) & speaktosp==1,-3, date_sum))) %>% 
  
  #MISSING IF PROXY SAYS CAN'T SPEAK TO SP*  
  mutate(date_sumr = ifelse(date_sum == -2 , NA, 
                            #0 IF SP UNABLE TO ANSWER*
                            ifelse(date_sum == -3 , 0, date_sum)))


########PRESIDENT AND VICE PRESIDENT NAME ITEMS AND COUNT########## 
##CODE ONLY YES/NO RESPONSES: MISSING/N.A. CODES -1,-9 LEFT MISSING *
##2:NO/DK OR -7:REFUSED RECODED TO 0:NO/DK/RF*
df <- df |>
  mutate(preslast = ifelse(presidna1 == 1, 1,
                           ifelse(presidna1 == 2 | presidna1 == -7, 0, NA))) |> 
  mutate(presfirst = ifelse(presidna3 == 1, 1,
                            ifelse(presidna3 == 2 | presidna3 == -7, 0, NA))) |> 
  mutate(vplast = ifelse(vpname1 == 1, 1,
                         ifelse(vpname1 == 2 | vpname1 == -7, 0, NA))) |> 
  mutate(vpfirst = ifelse(vpname3 == 1, 1,
                          ifelse(vpname3 == 2 | vpname3 == -7, 0, NA))) |> 
  
  #COUNT CORRECT PRESIDENT/VP NAME ITEMS*
  mutate(presvp = preslast + presfirst + vplast + vpfirst) |> 
  #PROXY SAYS CAN'T SPEAK TO SP 
  mutate(presvp = ifelse(speaktosp == 2 & is.na(presvp), -2, 
                         #PROXY SAYS CAN SPEAK TO SP BUT SP UNABLE TO ANSWER                           
                         ifelse((is.na(preslast) | is.na(presfirst) | is.na(vplast) | is.na(vpfirst)) & speaktosp==1 & is.na(presvp),-3, presvp))) |> 
  
  #MISSING IF PROXY SAYS CAN’T SPEAK TO SP*
  mutate(presvpr =  ifelse(presvp == -2 , NA, 
                           ifelse(presvp == -3 , 0, presvp))) |> 
  
  #ORIENTATION DOMAIN: SUM OF DATE RECALL AND PRESIDENT/VP NAMING* 
  mutate(date_prvp = date_sumr + presvpr)


#######EXECUTIVE FUNCTION DOMAIN: CLOCK DRAWING SCORE##########
#RECODE DCLKDRAW TO ALIGN WITH MISSING VALUES IN PREVIOUS ROUNDS (ROUND 10 ONLY)* 
df <- df |>
  mutate(dclkdraw = ifelse(speaktosp == 2 & dclkdraw == -9 & rnd==10, -2,
                           ifelse(speaktosp==1 & (quesremem==2 | quesremem==-7 | quesremem==-8) & dclkdraw==-9 & rnd==10, -3,
                                  ifelse(atdrwclck==2 & dclkdraw==-9 & rnd==10, -4,
                                         ifelse(atdrwclck==97 & dclkdraw==-9 & rnd==10, -7, dclkdraw)))))

#RECODE DCLKDRAW TO ALIGN WITH MISSING VALUES IN PREVIOUS ROUNDS (ROUNDS 11 AND FORWARD ONLY)* 
df<-df  |>
  mutate(dclkdraw = ifelse(speaktosp == 2 & dclkdraw == -9 & rnd>=11, -2, 
                           ifelse(speaktosp == 1 & (quesremem == 2 | quesremem == -7 | quesremem == -8) & dclkdraw == -9, -3 & rnd>=11, dclkdraw))) 
df<-df  |>
  mutate(clock_scorer = ifelse(dclkdraw == -3 | dclkdraw == -4 | dclkdraw == -7, 0,
                               #IMPUTE MEAN SCORE TO PERSONS MISSING A CLOCK*
                               #IF PROXY SAID CAN ASK SP*
                               ifelse(dclkdraw == -9 & speaktosp == 1, 2, 
                                      #IF SELF-RESPONDENT*       
                                      ifelse(dclkdraw == -9 & speaktosp == -1, 3, 
                                             ifelse(dclkdraw == -2 | dclkdraw == -9, NA, dclkdraw)))))


#MEMORY DOMAIN: IMMEDIATE AND DELAYED WORD RECALL 
df <- df |>
  mutate(irecall  =  ifelse(dwrdimmrc == -2 | dwrdimmrc == -1, NA,
                            ifelse(dwrdimmrc == -7 | dwrdimmrc == -3, 0, dwrdimmrc))) |> 
  mutate(irecall = ifelse(rnd==5 & dwrddlyrc==-9, NA, irecall)) |>  #round 5 only: set cases with missing word list and not previously assigned to missing
  
  mutate(drecall  =  ifelse(dwrddlyrc == -2 | dwrddlyrc == -1, NA,
                            ifelse(dwrddlyrc == -7 | dwrddlyrc == -3, 0, dwrddlyrc))) |> 
  mutate(drecall = ifelse(rnd==5 & dwrddlyrc==-9, NA, drecall)) |>  #round 5 only: set cases with missing word list and not previously assigned to missing
  
  mutate(wordrecall0_20 = irecall+drecall)


#CREATE COGNITIVE DOMAINS FOR ALL ELIGIBLE 

df<-df |> 
  mutate(clock65 = ifelse(clock_scorer == 0 | clock_scorer==1, 1, 
                          ifelse(clock_scorer > 1 & clock_scorer<6, 0, NA)))

df<-df |>  
  mutate(word65 = ifelse(wordrecall0_20 >= 0 & wordrecall0_20 <=3, 1, 
                         ifelse(wordrecall0_20 > 3 & wordrecall0_20 <=20, 0, NA)))

df<-df |>  
  mutate(datena65 = ifelse(date_prvp >= 0 & date_prvp <=3, 1, 
                           ifelse(date_prvp > 3 & date_prvp <= 8, 0, NA)))

#  *CREATE COGNITIVE DOMAIN SCORE*
df<-df |> 
  mutate(domain65 = clock65+word65+datena65)

#*SET CASES WITH MISSING WORD LIST AND NOT PREVIOUSLY ASSIGNED TO MISSING (ROUND 5 ONLY)
df<-df |>   
  mutate(demclas = ifelse(rnd==5 & dwrdlstnm==-9 & is.na(demclas), -9, demclas))

#UPDATE COGNITIVE CLASSIFICATION*
df<-df |> 
  #PROBABLE DEMENTIA
  mutate(demclas = ifelse(is.na(demclas) & (speaktosp == 1 | speaktosp == -1) & (domain65==2 | domain65==3), 1,
                          #POSSIBLE DEMENTIA
                          ifelse(is.na(demclas) & (speaktosp == 1 | speaktosp == -1) & domain65==1, 2,
                                 #NO DEMENITA                    
                                 ifelse(is.na(demclas) & (speaktosp == 1 | speaktosp == -1) & domain65==0, 3, demclas))))

#KEEP VARIABLES AND SAVE DATA
df<-df |> 
  dplyr::select(spid, rnd, demclas)

#CHANGE # AFTER "r" TO THE ROUND OF INTEREST
r11demclas <- df

#4. NAME AND SAVE DEMENTIA DATA FILE:
#CHANGE # AFTER "r" TO THE ROUND OF INTEREST
save(r11demclas, file = "~/R  HC/BMIN503_Final_Project/final final/NHATS_r11.dta") 

Once dementia class (demclas) is identified it is saved in the dataset ‘df1’.

3.3 Merging Datasets

#merged datasets (md). md1 <- left_join(df, df1, by = "spid") md2 <- left_join(md1, df3, by = "spid")
#merged datasets (md). 
md1 <- left_join(df, df1, by = "spid")
md2 <- left_join(md1, df3, by = "spid")

# choose probable dementia and dementia patients who live at home
dementia1 <- md2 |>
  filter(demclas %in% c("1", "2") & (r11dresid  %in% c("1")))

dementia2 <- md2 |>
  filter(demclas %in% c("1", "2") & (r11dresid  %in% c("1", "2")))

3.4 Preliminary Table Manipulation

Before creating the subset of the dataset for analysis, I will recode the variables. After reviewing the literature on dementia caregivers and social determinants of health, I selected the following variables.

Predictors: Caregiver level factors are identified as caregivers’ age, race, gender, self-reported income, and the highest education level. Also, these are recoded accordingly. The education level of the caregivers was categorized as “Less than high school (0)”, “High School (1)”, and “College or above (2).” For economic status, the caregivers’ reported income from the previous year was used. This study included both informal and formal support as part of the caregivers’ social determinants of health. Informal support included having friends or family (a) to talk to about important life matters, (b) to help with daily activities, such as running errands, and (c) to assist with care provision.10 Formal support included (a) participation in a support group for caregivers, (b) access to respite services that allowed the caregiver to take time off, and (c) involvement in a training program that assisted the caregiver in providing care for the care recipient.10 We used these individual items as support questions and each support question was answered by indicating whether or not they received support.

# Caregiver's Age #chd11dage

#Race
# Recode `race` to create a new binary variable
# 1 for "White, non-Hispanic" and 0 for "Non-White"
# Recode race to create a new variable 'race_recode'
dementia1 <- dementia1 |>
  mutate(
    race_recode = case_when(
      crl11dcgracehisp == 1 ~ 0,  # White, non-Hispanic
      crl11dcgracehisp == 2 ~ 1, #black, non-hispanic
      crl11dcgracehisp == 3 ~ 2,  # others
      crl11dcgracehisp == 4 ~ 3,
      chd11educ %in% c(5, 6) ~ NA_real_, # Missing or not applicable
      TRUE ~ NA_real_                      # Unhandled cases# hispanics
    ))  

table(dementia1$crl11dcgracehisp)

  1   2   3   4   5   6 
250 193  16  46   1  22 
table(dementia1$race_recode)

  0   1   2   3 
250 193  16  46 
# Gender: Male as reference (0), Female as 1
dementia1 <- dementia1 |>
  mutate(
    gender_recode = case_when(
      as.character(c11gender) == "1" ~ 0,  # Male
      as.character(c11gender) == "2" ~ 1,  # Female
      TRUE ~ NA_real_  # Handle any other unexpected cases
    )
  )

# Education: Recoding education levels into two categories
table(dementia1$chd11educ)

 -8  -7  -6   1   2   3   4   5   6   7   8   9 
 13   1   2   1   5  36 132  35 101  50  83  69 
dementia1 <- dementia1 |>
  mutate(
    edu_recode = case_when(
      chd11educ %in% c(1, 2, 3) ~ 1,      # Below and high school diploma
      chd11educ %in% c(4, 5) ~ 1,          # Some college
      chd11educ %in% c(6, 7, 8) ~ 2,       # College and beyond
      chd11educ == c(9) ~ 3,  
      chd11educ %in% c(-8, -7, -6) ~ NA_real_, # Missing or not applicable
      TRUE ~ NA_real_                      # Unhandled cases
    )
  )
table(dementia1$edu_recode)

  1   2   3 
209 234  69 
# Marital Status: Recoding marital status into binary (married vs. not married)
dementia1 <- dementia1 |>
  mutate(
    martstat_recode = case_when(
      chd11martstat == 1 ~ 0,     # Married
      chd11martstat %in% 2:6 ~ 1, # Not married (single, divorced, etc.)
      chd11martstat %in% c(-8, -6) ~ NA_real_, # Missing or not applicable
      TRUE ~ NA_real_                      # Unhandled cas
    )
  )

table(dementia1$martstat_recode)

  0   1 
204 232 
library(dplyr)
library(tibble)
# preparing dataset 
final_dementia <- dementia1 |> 
  mutate(
    race_recode = case_when(
      crl11dcgracehisp == 1 ~ "White, non-Hispanic",
      crl11dcgracehisp == 2 ~ "Black, non-Hispanic",
      crl11dcgracehisp == 3 ~ "Other",
      crl11dcgracehisp == 4 ~ "Hispanic",
      TRUE ~ NA_character_
    ),
    gender_recode = case_when(
      as.character(c11gender) == "1" ~ "Male",
      as.character(c11gender) == "2" ~ "Female",
      TRUE ~ NA_character_
    ),
    edu_recode = case_when(
      chd11educ %in% c(1, 2, 3) ~ "Below high school",
      chd11educ %in% c(4, 5) ~ "Some college",
      chd11educ %in% c(6, 7, 8) ~ "College and beyond",
      chd11educ == 9 ~ "Unknown",
      TRUE ~ NA_character_
    ),
    martstat_recode = case_when(
      chd11martstat == 1 ~ "Married",
      chd11martstat %in% 2:6 ~ "Not married",
      TRUE ~ NA_character_
    )
  )

Caregivers’ caregiving activities

#recoding: 1(everyday), 2(most day), 3(someday)–1; 4(rarely),5(never)0 #C11 CA1 HOW OFT HELP WITH CHORES (cca11hwoftchs) #C11 CA2 HOW OFTEN SHOPPED FOR SP (cca11hwoftshp) #C11 CA6 HOW OFT HELP PERS CARE (cca11hwoftpc ) #C11 CA6B1 HELP CARE FOR TEETH (PREVIOUSLY CA11F) #C11 CA7 HOW OFT HLP GTNG ARD HOME (cca11hwofthom) #11 CA9 HOW OFTEN DROVE SP (cca11hwoftdrv) #C11 CA10 OFTN WENT ON OTH TRANSPR (cca11hwoftott )

library(dplyr)
dementia1 <- dementia1 |>
  mutate(across(
    c(cca11hwoftchs, cca11hwoftshp, cca11hwoftpc, 
      cca11hwofthom, cca11hwoftdrv, cca11hwoftott), 
  )) |>
  mutate(
    cca11hwoftchs_recoded = case_when(
      cca11hwoftchs %in% c(1, 2, 3) ~ 1,
      cca11hwoftchs %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwoftshp_recoded = case_when(
      cca11hwoftshp %in% c(1, 2, 3) ~ 1,
      cca11hwoftshp %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwoftpc_recoded = case_when(
      cca11hwoftpc %in% c(1, 2, 3) ~ 1,
      cca11hwoftpc %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwofthom_recoded = case_when(
      cca11hwofthom %in% c(1, 2, 3) ~ 1,
      cca11hwofthom %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwoftdrv_recoded = case_when(
      cca11hwoftdrv %in% c(1, 2, 3) ~ 1,
      cca11hwoftdrv %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwoftott_recoded = case_when(
      cca11hwoftott %in% c(1, 2, 3) ~ 1,
      cca11hwoftott %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    )
  )
  

#C11 CA3 EVER HELP ORDER MEDICINES (cca11hlpordmd) - yes 1 no 2-->0
#C11 CA5A GO ONLINE billing or banking (cca11hlpbnkng)
#C11 CA5A GO ONLINE SHOP FR GROC (cca11cmpgrcry)
#C11 CA5B GO ONLINE ORDER MEDS (cca11cmpordrx )
#C11 CA5C GO ONLINE BILLS BANKNG (cca11cmpbnkng )
#C11 CA6B1 HELP CARE FOR TEETH (PREVIOUSLY CA11F) (cca11hlpteeth) 
#cdc11hlpyrpls 

dementia1 <- dementia1 |>
  mutate(across(
    c(cca11hlpordmd, cca11hlpbnkng, cca11cmpgrcry, 
      cca11cmpordrx, cca11cmpbnkng, cca11hlpteeth,cdc11hlpyrpls),
    ~ case_when(
        . == 1 ~ 1,
        . == 2 ~ 0,
        . %in% c(-8, -6, -1) ~ NA_real_,
        TRUE ~ NA_real_
      ),
    .names = "{.col}_recoded"
  ))

#caregiver’s caregiving activities cca11hwoftchs #C11 CA1 HOW OFT HELP WITH CHORES cca11hwoftshp #C11 CA2 HOW OFTEN SHOPPED FOR SP cca11hwoftpc #C11 CA6 HOW OFT HELP PERS CARE cca11hwofthom #C11 CA7 HOW OFT HLP GTNG ARD HOME cca11hwoftdrv #11 CA9 HOW OFTEN DROVE SP cca11hwoftott #C11 CA10 OFTN WENT ON OTH TRANSPR

4 caregiver’s features

che11enrgylmt #Energy often limited cac11diffphy # Caregiver physical difficulty helping cac11exhaustd #Caregiver exhausted at night cac11toomuch #Care more than can handle cac11uroutchg #Care routine then changes cac11notime #No time for self (-8,-7,-6) cac11diffemlv #Caregiver emotional difficulty (-1, 1-5) cpp11hlpkptgo #Kept from going out (-6-1, 1,2) che11health #General health (-8, 1-5) che11sleepint #Interrupted sleep (-8,-7,-6, 1-5) op11numhrsday #Number of hours per day help (-7,-1, 1-6) op11numdaysmn #Number of days help per month (-7,-1 1-6)

#Sociodemographic features (persons living with dementia and caregiver) table(dementia1$cac11notime)

op11leveledu #Caregiver education (na -1, 1-5) cac11diffinc #Caregiver financial difficulties (-8,-7,-6, binary 1,2) ew11progneed1 #Persons living with dementia received food stamps (-8, -7 binary) ew11finhlpfam #Persons living with dementia financial help from family (-8, -7) binary mc11havregdoc #Persons living with dementia have a regular doctor (1,2binary) hc11hosptstay #Persons living with dementia hospital stay in last (1,2) 12-months hc11hosovrnht #Persons living with dementia number of hospital stays (-7, -1, 1-6 times)

Outcomes: Caregivers’ anxiety and depressive symptoms are measured by two questions each.First, anxiety was measured Generalized Anxiety Disorder-2 (GAD-2) Scale which consists of two questions. Since the NHATS provided GAD-2 data, this study utilized it to measure anxiety levels among care recipients. Each item on the scale is rated on a four-point Likert scale, ranging from 0 (not at all) to 3 (nearly every day), resulting in a total score between 0 and 6. Higher scores correspond to greater anxiety, with a total GAD-2 score of 3 or more indicating anxiety.

The care recipients’ depression was evaluated using the Patient Health Questionnaire-2 (PHQ-2) Scale. Given that the NHATS included PHQ-2, this study utilized it to measure depression in care recipients. Each item on the scale was measured with a four-point Likert scale, ranging from 0 (not at all) to 3 (nearly every day), resulting a total score between 0 and 6, with higher scores indicating more severe depression. A PHQ-2 score ranges from 0-6. The authors identified a score of 3 as the optimal cutpoint when using the PHQ-2 to screen for depression. If the score is 3 or greater, major depressive disorder is likely.

# Sum the two questions for GAD2
dementia1$total_gad2 <- dementia1$che11fltnervs + dementia1$che11fltworry
# Recode the combined variable using a cut-off of 3
dementia1$gad2_cg_cat <- ifelse(dementia1$total_gad2 < 3, 0, 1)
 table(dementia1$gad2_cg_cat)

  0   1 
279 249 
 summary(dementia1$gad2_cg_cat) #1 ~ anxiety
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.4716  1.0000  1.0000     251 
# Sum of the two questions for PHQ2 (che11fltltlin + che11fltdown) 
dementia1$total_phq2 <- dementia1$che11fltltlin+ dementia1$che11fltdown
#Recode the combined variable using a cut-off of 3
dementia1$phq2_cg_cat <- ifelse(dementia1$total_phq2 < 3, 0, 1)
 table(dementia1$phq2_cg_cat)

  0   1 
276 252 
 summary(dementia1$phq2_cg_cat) #1 ~ depression
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.4773  1.0000  1.0000     251 

Data analysis

For data analysis, we first conducted descriptive analyses, including means, standard deviations, ranges, and percentages, to summarize the dataset. To investigate how caregivers’ social strains and caregiver-level factors influence caregiver depression, we performed logistic regression analyses. Guided by the conceptual framework of this study, univariate logistic regression analyses were employed to identify caregivers’ social strains and caregiver-level factors significantly associated with caregiver anxiety and depression, controlling for care recipient-level factors. Variables with a p-value below 0.05 in the univariate analyses were included in the subsequent multivariate logistic regression model. The multivariate model was then constructed to determine which factors most strongly influenced caregiver anxiety and depression. All statistical analyses were conducted using R, with statistical significance set at a p-value of less than 0.05.

4.1

#choosing only one caregiver for each participant
final <- dementia1 |>
  group_by(spid) |>
  slice_head(n = 1) |>
  ungroup()

#total 563 caregivers 
#creating subset to work more effectively
library(dplyr)
selected_vars <- c(
  "spid","demclas", "cca11hwoftchs", "cca11hwoftshp", "cca11hwoftpc",
  "cca11hwofthom", "cca11hwoftdrv", "cca11hwoftott",
  "che11enrgylmt", "cac11diffphy", "cac11exhaustd",
  "cac11toomuch", "cac11uroutchg", "cac11notime",
  "cac11diffemlv", "cpp11hlpkptgo", "che11health", 
  "che11sleepint", 
   "ew11progneed1", "ew11finhlpfam", 
  "mc11havregdoc", "hc11hosptstay", "hc11hosovrnht", 
  "cac11diffinc", "pa11hlkepfvst", "pa11hlkpfrclb", "pa11hlkpgoenj", "pa11hlkpfrwrk", "pa11hlkpfrvol", "pa11prcranoth", "race_recode", "gender_recode", "edu_recode", "chd11dage", "martstat_recode", "phq2_cg_cat", "gad2_cg_cat"
)

dementia_subset <- dementia1 |> select(all_of(selected_vars))
head(dementia_subset)
# A tibble: 6 × 37
      spid demclas cca11hwoftchs    cca11hwoftshp    cca11hwoftpc  cca11hwofthom
     <dbl>   <dbl> <dbl+lbl>        <dbl+lbl>        <dbl+lbl>     <dbl+lbl>    
1 10000036       1 NA               NA               NA            NA           
2 10000041       2  1 [1 EVERY DAY]  1 [1 EVERY DAY]  1 [1 EVERY …  1 [1 EVERY …
3 10000041       2  5 [5 NEVER]      4 [4 RARELY]     5 [5 NEVER]   3 [3 SOME D…
4 10000051       1  1 [1 EVERY DAY]  3 [3 SOME DAYS]  2 [2 MOST D…  3 [3 SOME D…
5 10000064       1  2 [2 MOST DAYS]  5 [5 NEVER]      3 [3 SOME D…  3 [3 SOME D…
6 10000064       1  1 [1 EVERY DAY]  1 [1 EVERY DAY]  5 [5 NEVER]   3 [3 SOME D…
# ℹ 31 more variables: cca11hwoftdrv <dbl+lbl>, cca11hwoftott <dbl+lbl>,
#   che11enrgylmt <dbl+lbl>, cac11diffphy <dbl+lbl>, cac11exhaustd <dbl+lbl>,
#   cac11toomuch <dbl+lbl>, cac11uroutchg <dbl+lbl>, cac11notime <dbl+lbl>,
#   cac11diffemlv <dbl+lbl>, cpp11hlpkptgo <dbl+lbl>, che11health <dbl+lbl>,
#   che11sleepint <dbl+lbl>, ew11progneed1 <dbl+lbl>, ew11finhlpfam <dbl+lbl>,
#   mc11havregdoc <dbl+lbl>, hc11hosptstay <dbl+lbl>, hc11hosovrnht <dbl+lbl>,
#   cac11diffinc <dbl+lbl>, pa11hlkepfvst <dbl+lbl>, pa11hlkpfrclb <dbl+lbl>, …
# Select only numeric columns
numeric_dementia1 <- dementia_subset |> select(where(is.numeric))

# Compute the correlation matrix
cor_matrix <- cor(numeric_dementia1, method = "kendall", use = "pairwise.complete.obs")
# Filter correlations above a threshold
high_corr <- which(abs(cor_matrix) > 0.8 & abs(cor_matrix) < 1, arr.ind = TRUE)
high_corr_pairs <- data.frame(
  Var1 = rownames(cor_matrix)[high_corr[, 1]],
  Var2 = colnames(cor_matrix)[high_corr[, 2]],
  Correlation = cor_matrix[high_corr]
)
print(high_corr_pairs)
           Var1          Var2 Correlation
1 cca11hwoftdrv cca11hwoftdrv   1.0000000
2 cca11hwoftott cca11hwoftott   1.0000000
3 hc11hosptstay hc11hosptstay   1.0000000
4 hc11hosovrnht hc11hosptstay  -0.9391286
5 hc11hosptstay hc11hosovrnht  -0.9391286
6 pa11prcranoth pa11prcranoth   1.0000000
7 gender_recode gender_recode   1.0000000
# Install and load pheatmap if necessary
library(pheatmap)

# Create the heatmap
pheatmap(cor_matrix, color = colorRampPalette(c("blue", "white", "red"))(50))

Since my outcomes are GAD-2 and PHQ-2, I am most focused on those variables surrounding the square: che11energylmt, che11health, cac11diffemlv, hc11hosovrnht, marstat_recode, race_recode, mc11havregdoc, gender_recode, cca11hwoftott, ia11totinc, ew11finhlpfam, and ew11progneed.

Pearson’s correlation matrices were presented as heatmaps for both rounds (5 and 7) to visually assess the data and evaluate the independence of variables (Supplementary Figure 2a and b). The Pearson’s correlation coefficient quantifies the linear relationship between two continuous variables, with values ranging from −1 to +1. A coefficient of 0 indicates no linear correlation, while negative and positive values represent negative and positive correlations, respectively. A p-value of < .05 was used to define statistical significance. Then I selected the most representative variable to reduce redundancy in concepts among highly correlated items. The negative values in the dataset were not addressed in this step.

4.1.1 Imputation

The original dataset includes numerous negative and N/A values, and the sample size is small, necessitating preprocessing before feature selection. To handle the small sample size, negative values were transformed into N/A values. These N/A values were then imputed based on the type of feature. For continuous variables, N/A values were replaced with the median of the respective column, while the most frequent category level was used to impute N/A values for categorical features:

che11energylmt, che11health, cac11diffemlv, hc11hosovrnht, marstat_recode, race_recode, mc11havregdoc, gender_recode, cca11hwoftott, ia11totinc, ew11finhlpfam, ew11progneed.

#coverting negative value to NA
dementia_subset[dementia_subset < 0] <- NA

#imputation
continuous_vars <- sapply(dementia_subset, is.numeric)  # categorical
dementia_subset[continuous_vars] <- lapply(dementia_subset[continuous_vars], function(x) ifelse(is.na(x), median(x, na.rm = TRUE), x))

# categorical: change NA to max
categorical_vars <- sapply(dementia_subset, is.factor)  # find
dementia_subset[categorical_vars] <- lapply(dementia_subset[categorical_vars], function(x) {
  levels(x) <- append(levels(x), names(sort(table(x), decreasing = TRUE))[1])  
  ifelse(is.na(x), names(sort(table(x), decreasing = TRUE))[1], x)  
})
#final cleaning for dataset
final_dementia <- dementia_subset |>
  group_by(spid) |>
  slice_head(n = 1) |>
  ungroup()

#creating new dataset for each outcome
anxiety <- subset(
  final_dementia,
  select = c(gad2_cg_cat, cca11hwoftchs, cca11hwoftshp, cca11hwoftott, che11enrgylmt, 
             cac11diffphy, cac11exhaustd, cac11diffemlv, che11health, che11sleepint, ew11progneed1, race_recode, gender_recode, edu_recode, 
             chd11dage, martstat_recode)
)

depression <- subset(
  final_dementia,
  select = c(phq2_cg_cat, cca11hwoftchs, cca11hwoftshp, cca11hwoftott, che11enrgylmt, 
             cac11diffphy, cac11exhaustd, cac11diffemlv, che11health, che11sleepint, ew11progneed1, race_recode, gender_recode, edu_recode, 
             chd11dage, martstat_recode)
)

4.2 Results

4.2.1 Characteristics of Participants

# Summary and Visualization of 'demclas'
table(final$demclas) #1 probable dementia #2 dementia diagnosis

  1   2 
331 232 
# Adding labels for clarity
final$demclas_label <- factor(
  final$demclas, 
  levels = c(1, 2), 
  labels = c("Probable Dementia", "Possible Dementia")
)

# Plotting with the updated legend
ggplot(final, aes(x = demclas_label)) + 
  geom_histogram(
    stat = "count", 
    color = "grey", 
    fill = "grey", 
    alpha = 0.7
  ) +
  labs(
    title = "Distribution of Dementia Classifications",
    x = "Dementia Class",
    y = "Count",
    fill = "Dementia Type"
  ) +
  theme_minimal()
Warning in geom_histogram(stat = "count", color = "grey", fill = "grey", :
Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

4.2.1.1 Graph

# Dementia class analyses

#labeling variables.
final1 <- final_dementia |> 
  mutate(
    race_recode = factor(race_recode, levels = c("0", "1", "2", "3"),
                         labels = c("White, non-Hispanic", "Black, non-Hispanic", "Other", "Hispanic")),
    gender_recode = factor(gender_recode, levels = c("0", "1"),
                           labels = c("Male", "Female")),
    edu_recode = factor(edu_recode, levels = c("1", "2", "3" ),
                        labels = c("Below high school", "Some college", "College and beyond")),
    martstat_recode = factor(martstat_recode, levels = c("0", "1"),
                             labels = c("Not Married", " Married"))
  )

# 1. Bar Plot for Categorical Variables (e.g., Race, Gender, Education, Marital Status)
# Race
ggplot(final1, aes(x = race_recode, fill = demclas)) + 
  geom_bar() +  # Default bar plot based on counts
  labs(title = "Race Distribution by Demographic Class",
       x = "Race",
       y = "Count") +
  scale_fill_brewer(palette = "Set3") + 
  theme_minimal()
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

# Gender - Count plot
ggplot(final1, aes(x = gender_recode, fill = demclas)) + 
  geom_bar() +  # Default bar plot based on counts
  labs(title = "Gender Distribution by Demographic Class",
       x = "Gender",
       y = "Count") +
  scale_fill_brewer(palette = "Set3") + 
  theme_minimal()
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

# Education - Count plot
ggplot(final1, aes(x = edu_recode, fill = demclas)) + 
  geom_bar() +  # Default bar plot based on counts
  labs(title = "Education Distribution by Demographic Class",
       x = "Education",
       y = "Count") +
  scale_fill_brewer(palette = "Set3") + 
  theme_minimal()
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

# Marital Status - Count plot
ggplot(final1, aes(x = martstat_recode, fill = demclas)) + 
  geom_bar() +  # Default bar plot based on counts
  labs(title = "Marital Status Distribution by Demographic Class",
       x = "Marital Status",
       y = "Count") +
  scale_fill_brewer(palette = "Set3") + 
  theme_minimal()
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

# 2. Boxplot for Continuous Variable (Age)
ggplot(final1, aes(x = demclas, y = chd11dage, fill = demclas)) +
  geom_boxplot() +
  labs(title = "Age Distribution by Demographic Class",
       x = "Demographic Class",
       y = "Age") +
  scale_fill_brewer(palette = "Set3") + 
  theme_minimal()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

library(gtsummary)
#making table
summary_table <- final1 |>
  select(race_recode, gender_recode, edu_recode, martstat_recode, chd11dage) |>
  tbl_summary(
    statistic = list(all_continuous() ~ "{mean} ({sd})", all_categorical() ~ "{n} ({p}%)"),
    label = list(
      race_recode ~ "Race",
      gender_recode ~ "Gender",
      edu_recode ~ "Education",
      martstat_recode ~ "Marital Status",
      chd11dage ~ "Age"
    ),
    missing_text = "(Missing)"
  ) |>
  modify_header(label ~ "**Variable**") |>  
  bold_labels() 


summary_table
Variable N = 5631
Race
    White, non-Hispanic 152 (27%)
    Black, non-Hispanic 373 (66%)
    Other 9 (1.6%)
    Hispanic 29 (5.2%)
Gender
    Male 96 (17%)
    Female 467 (83%)
Education
    Below high school 133 (24%)
    Some college 394 (70%)
    College and beyond 36 (6.4%)
Marital Status
    Not Married 103 (18%)
     Married 460 (82%)
Age 65 (8)
1 n (%); Mean (SD)

This exploratory study employs multiple machine learning techniques—including correlation matrix analysis, glm, and random forest (RF)—to identify key predictors of caregiver depression and anxiety. This multipronged approach is essential given the diverse types of data in this study. Machine learning methods, being inductive, support hypothesis generation and allow for systematic feature reduction by excluding variables deemed unimportant across multiple methods. This approach refines the feature set, enhancing the interpretability and predictive accuracy of the models.

4.2.2 Logistic regression

# Logistic regression for outcome gad2_cg_cat
model_gad2 <- glm(
  gad2_cg_cat ~ cca11hwoftchs + cca11hwoftshp + cca11hwoftott + che11enrgylmt + 
    cac11diffphy + cac11exhaustd + cac11diffemlv + che11health + che11sleepint + 
    ew11progneed1 + race_recode + gender_recode + edu_recode + chd11dage + 
    martstat_recode, 
  data = anxiety, 
  family = binomial(link = "logit")
)
summary(model_gad2)

Call:
glm(formula = gad2_cg_cat ~ cca11hwoftchs + cca11hwoftshp + cca11hwoftott + 
    che11enrgylmt + cac11diffphy + cac11exhaustd + cac11diffemlv + 
    che11health + che11sleepint + ew11progneed1 + race_recode + 
    gender_recode + edu_recode + chd11dage + martstat_recode, 
    family = binomial(link = "logit"), data = anxiety)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      7.055252   2.225233   3.171  0.00152 ** 
cca11hwoftchs    0.198116   0.133486   1.484  0.13776    
cca11hwoftshp   -0.096310   0.145559  -0.662  0.50819    
cca11hwoftott    0.203160   0.220421   0.922  0.35669    
che11enrgylmt   -0.082552   0.254866  -0.324  0.74601    
cac11diffphy    -0.978454   0.358695  -2.728  0.00638 ** 
cac11exhaustd   -1.419467   0.215747  -6.579 4.73e-11 ***
cac11diffemlv   -0.015984   0.182646  -0.088  0.93026    
che11health      0.331630   0.161528   2.053  0.04006 *  
che11sleepint   -0.521428   0.173869  -2.999  0.00271 ** 
ew11progneed1    0.460358   0.413274   1.114  0.26531    
race_recode     -0.564930   0.181982  -3.104  0.00191 ** 
gender_recode   -0.771277   0.300930  -2.563  0.01038 *  
edu_recode      -0.630911   0.233963  -2.697  0.00700 ** 
chd11dage       -0.002079   0.013950  -0.149  0.88151    
martstat_recode -0.869216   0.339016  -2.564  0.01035 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 660.67  on 562  degrees of freedom
Residual deviance: 411.90  on 547  degrees of freedom
AIC: 443.9

Number of Fisher Scoring iterations: 5
# Logistic regression for outcome phq2_cg_cat
model_phq2 <- glm(
  phq2_cg_cat ~ cca11hwoftchs + cca11hwoftshp + cca11hwoftott + che11enrgylmt + 
    cac11diffphy + cac11exhaustd + cac11diffemlv + che11health + che11sleepint + 
    ew11progneed1 + race_recode + gender_recode + edu_recode + chd11dage + 
    martstat_recode, 
  data = depression, 
  family = binomial(link = "logit")
)
summary(model_phq2)

Call:
glm(formula = phq2_cg_cat ~ cca11hwoftchs + cca11hwoftshp + cca11hwoftott + 
    che11enrgylmt + cac11diffphy + cac11exhaustd + cac11diffemlv + 
    che11health + che11sleepint + ew11progneed1 + race_recode + 
    gender_recode + edu_recode + chd11dage + martstat_recode, 
    family = binomial(link = "logit"), data = depression)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      6.010701   2.121163   2.834 0.004602 ** 
cca11hwoftchs   -0.026695   0.131499  -0.203 0.839129    
cca11hwoftshp   -0.001903   0.143780  -0.013 0.989440    
cca11hwoftott    0.092992   0.205062   0.453 0.650202    
che11enrgylmt   -0.238207   0.242829  -0.981 0.326608    
cac11diffphy    -0.755139   0.347061  -2.176 0.029569 *  
cac11exhaustd   -1.244365   0.204329  -6.090 1.13e-09 ***
cac11diffemlv    0.138485   0.174980   0.791 0.428691    
che11health      0.338219   0.156592   2.160 0.030782 *  
che11sleepint   -0.294320   0.161730  -1.820 0.068787 .  
ew11progneed1    0.271095   0.380739   0.712 0.476452    
race_recode     -0.640039   0.176302  -3.630 0.000283 ***
gender_recode   -0.801365   0.289794  -2.765 0.005687 ** 
edu_recode      -0.725094   0.228080  -3.179 0.001477 ** 
chd11dage        0.002991   0.013403   0.223 0.823433    
martstat_recode -0.680916   0.330734  -2.059 0.039514 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 652.71  on 562  degrees of freedom
Residual deviance: 445.84  on 547  degrees of freedom
AIC: 477.84

Number of Fisher Scoring iterations: 5
# Calculate odds ratios and confidence intervals for gad2_cg_cat model
odds_ratios_gad2 <- exp(coef(model_gad2))
conf_intervals_gad2 <- exp(confint(model_gad2))
Waiting for profiling to be done...
# Combine into a tidy table for gad2_cg_cat
results_gad2 <- data.frame(
  Variable = names(odds_ratios_gad2),
  Odds_Ratio = odds_ratios_gad2,
  CI_Lower = conf_intervals_gad2[, 1],
  CI_Upper = conf_intervals_gad2[, 2]
)
print("Odds Ratios for gad2_cg_cat Model:")
[1] "Odds Ratios for gad2_cg_cat Model:"
print(results_gad2)
                       Variable   Odds_Ratio   CI_Lower     CI_Upper
(Intercept)         (Intercept) 1158.9291601 15.5110504 9.780599e+04
cca11hwoftchs     cca11hwoftchs    1.2191040  0.9381851 1.585724e+00
cca11hwoftshp     cca11hwoftshp    0.9081825  0.6829866 1.210961e+00
cca11hwoftott     cca11hwoftott    1.2252679  0.8096916 1.926016e+00
che11enrgylmt     che11enrgylmt    0.9207632  0.5554478 1.514254e+00
cac11diffphy       cac11diffphy    0.3758918  0.1852315 7.595119e-01
cac11exhaustd     cac11exhaustd    0.2418430  0.1564642 3.654413e-01
cac11diffemlv     cac11diffemlv    0.9841432  0.6884434 1.412463e+00
che11health         che11health    1.3932377  1.0204185 1.925540e+00
che11sleepint     che11sleepint    0.5936722  0.4193536 8.313417e-01
ew11progneed1     ew11progneed1    1.5846416  0.7227122 3.671437e+00
race_recode         race_recode    0.5684000  0.3960653 8.100796e-01
gender_recode     gender_recode    0.4624220  0.2563228 8.360574e-01
edu_recode           edu_recode    0.5321066  0.3356476 8.416502e-01
chd11dage             chd11dage    0.9979228  0.9707682 1.025558e+00
martstat_recode martstat_recode    0.4192802  0.2150644 8.152782e-01
# Calculate odds ratios and confidence intervals for phq2_cg_cat model
odds_ratios_phq2 <- exp(coef(model_phq2))
conf_intervals_phq2 <- exp(confint(model_phq2))
Waiting for profiling to be done...
# Combine into a tidy table for phq2_cg_cat
results_phq2 <- data.frame(
  Variable = names(odds_ratios_phq2),
  Odds_Ratio = odds_ratios_phq2,
  CI_Lower = conf_intervals_phq2[, 1],
  CI_Upper = conf_intervals_phq2[, 2]
)
print("Odds Ratios for phq2_cg_cat Model:")
[1] "Odds Ratios for phq2_cg_cat Model:"
print(results_phq2)
                       Variable  Odds_Ratio  CI_Lower     CI_Upper
(Intercept)         (Intercept) 407.7690755 6.5860233 2.755568e+04
cca11hwoftchs     cca11hwoftchs   0.9736579 0.7502830 1.257981e+00
cca11hwoftshp     cca11hwoftshp   0.9980988 0.7542048 1.327802e+00
cca11hwoftott     cca11hwoftott   1.0974528 0.7423853 1.666580e+00
che11enrgylmt     che11enrgylmt   0.7880393 0.4846498 1.261420e+00
cac11diffphy       cac11diffphy   0.4699451 0.2378984 9.313639e-01
cac11exhaustd     cac11exhaustd   0.2881238 0.1910987 4.267062e-01
cac11diffemlv     cac11diffemlv   1.1485325 0.8172844 1.627034e+00
che11health         che11health   1.4024482 1.0377242 1.920274e+00
che11sleepint     che11sleepint   0.7450383 0.5409786 1.021993e+00
ew11progneed1     ew11progneed1   1.3113991 0.6348716 2.838612e+00
race_recode         race_recode   0.5272717 0.3705401 7.411819e-01
gender_recode     gender_recode   0.4487161 0.2541691 7.934679e-01
edu_recode           edu_recode   0.4842792 0.3086426 7.561883e-01
chd11dage             chd11dage   1.0029951 0.9768238 1.029731e+00
martstat_recode martstat_recode   0.5061534 0.2643094 9.698317e-01
library(broom)
tidy_gad2 <- tidy(model_gad2, exponentiate = TRUE, conf.int = TRUE)
tidy_phq2 <- tidy(model_phq2, exponentiate = TRUE, conf.int = TRUE)

print("Tidy Results for gad2_cg_cat Model:")
[1] "Tidy Results for gad2_cg_cat Model:"
print(tidy_gad2)
# A tibble: 16 × 7
   term            estimate std.error statistic  p.value conf.low conf.high
   <chr>              <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
 1 (Intercept)     1159.       2.23      3.17   1.52e- 3   15.5   97806.   
 2 cca11hwoftchs      1.22     0.133     1.48   1.38e- 1    0.938     1.59 
 3 cca11hwoftshp      0.908    0.146    -0.662  5.08e- 1    0.683     1.21 
 4 cca11hwoftott      1.23     0.220     0.922  3.57e- 1    0.810     1.93 
 5 che11enrgylmt      0.921    0.255    -0.324  7.46e- 1    0.555     1.51 
 6 cac11diffphy       0.376    0.359    -2.73   6.38e- 3    0.185     0.760
 7 cac11exhaustd      0.242    0.216    -6.58   4.73e-11    0.156     0.365
 8 cac11diffemlv      0.984    0.183    -0.0875 9.30e- 1    0.688     1.41 
 9 che11health        1.39     0.162     2.05   4.01e- 2    1.02      1.93 
10 che11sleepint      0.594    0.174    -3.00   2.71e- 3    0.419     0.831
11 ew11progneed1      1.58     0.413     1.11   2.65e- 1    0.723     3.67 
12 race_recode        0.568    0.182    -3.10   1.91e- 3    0.396     0.810
13 gender_recode      0.462    0.301    -2.56   1.04e- 2    0.256     0.836
14 edu_recode         0.532    0.234    -2.70   7.00e- 3    0.336     0.842
15 chd11dage          0.998    0.0140   -0.149  8.82e- 1    0.971     1.03 
16 martstat_recode    0.419    0.339    -2.56   1.03e- 2    0.215     0.815
print("Tidy Results for phq2_cg_cat Model:")
[1] "Tidy Results for phq2_cg_cat Model:"
print(tidy_phq2)
# A tibble: 16 × 7
   term            estimate std.error statistic       p.value conf.low conf.high
   <chr>              <dbl>     <dbl>     <dbl>         <dbl>    <dbl>     <dbl>
 1 (Intercept)      408.       2.12      2.83   0.00460          6.59  27556.   
 2 cca11hwoftchs      0.974    0.131    -0.203  0.839            0.750     1.26 
 3 cca11hwoftshp      0.998    0.144    -0.0132 0.989            0.754     1.33 
 4 cca11hwoftott      1.10     0.205     0.453  0.650            0.742     1.67 
 5 che11enrgylmt      0.788    0.243    -0.981  0.327            0.485     1.26 
 6 cac11diffphy       0.470    0.347    -2.18   0.0296           0.238     0.931
 7 cac11exhaustd      0.288    0.204    -6.09   0.00000000113    0.191     0.427
 8 cac11diffemlv      1.15     0.175     0.791  0.429            0.817     1.63 
 9 che11health        1.40     0.157     2.16   0.0308           1.04      1.92 
10 che11sleepint      0.745    0.162    -1.82   0.0688           0.541     1.02 
11 ew11progneed1      1.31     0.381     0.712  0.476            0.635     2.84 
12 race_recode        0.527    0.176    -3.63   0.000283         0.371     0.741
13 gender_recode      0.449    0.290    -2.77   0.00569          0.254     0.793
14 edu_recode         0.484    0.228    -3.18   0.00148          0.309     0.756
15 chd11dage          1.00     0.0134    0.223  0.823            0.977     1.03 
16 martstat_recode    0.506    0.331    -2.06   0.0395           0.264     0.970

4.2.3 GLM

4.2.3.1 Anxiety

#creating training/ testing data
anxiety_split <- initial_split(anxiety, 
                            prop = 0.80)
anxiety_split #<450/113/563>
<Training/Testing/Total>
<450/113/563>
anxiety_train <- training(anxiety_split)
anxiety_test <- testing(anxiety_split)

#glm
set.seed(123)
lr_class_spec<-logistic_reg()|>
  set_engine("glm")

anxiety_train$gad2_cg_cat <- as.factor(anxiety_train$gad2_cg_cat)
anxiety_test$gad2_cg_cat <- as.factor(anxiety_test$gad2_cg_cat)

lr_class_fit <- lr_class_spec |>
  fit(gad2_cg_cat ~ ., data = anxiety_train)

## top variables
significant_predictors <- tidy(lr_class_fit$fit) |>
  filter(p.value < 0.05)
print(significant_predictors)
# A tibble: 8 × 5
  term            estimate std.error statistic  p.value
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        5.27      2.45       2.15 3.15e- 2
2 cac11diffphy      -0.831     0.416     -2.00 4.58e- 2
3 cac11exhaustd     -1.53      0.250     -6.11 9.73e-10
4 che11health        0.396     0.188      2.10 3.56e- 2
5 che11sleepint     -0.518     0.199     -2.60 9.46e- 3
6 race_recode       -0.503     0.199     -2.53 1.16e- 2
7 edu_recode        -0.625     0.278     -2.25 2.42e- 2
8 martstat_recode   -1.16      0.402     -2.88 3.99e- 3
### 
anxiety_glm_wf <- workflow() |>
  add_model(lr_class_spec) |>
  add_formula(gad2_cg_cat ~ .)


  # Fit the workflow to the test data
anxiety_glm_fit <- anxiety_glm_wf |> 
  fit(data = anxiety_test)

# Generate predictions with probabilities
anxiety_glm_predicted <- predict(anxiety_glm_fit, new_data = anxiety_test, type = "prob")

anxiety_glm_predicted 
# A tibble: 113 × 2
    .pred_0 .pred_1
      <dbl>   <dbl>
 1 0.989     0.0114
 2 0.000162  1.00  
 3 0.989     0.0114
 4 0.817     0.183 
 5 0.936     0.0639
 6 0.0177    0.982 
 7 0.989     0.0114
 8 0.989     0.0114
 9 0.911     0.0891
10 0.702     0.298 
# ℹ 103 more rows
# Combine into a single data frame
anxiety_glm_pred_values <- bind_cols(
  truth = anxiety_test$gad2_cg_cat,  # Actual values of the outcome variable
  predict(anxiety_glm_fit, new_data = anxiety_test),  # Predicted class labels
  predict(anxiety_glm_fit, new_data = anxiety_test, type = "prob")  # Predicted probabilities
)

print(anxiety_glm_pred_values)
# A tibble: 113 × 4
   truth .pred_class  .pred_0 .pred_1
   <fct> <fct>          <dbl>   <dbl>
 1 0     0           0.989     0.0114
 2 1     1           0.000162  1.00  
 3 0     0           0.989     0.0114
 4 1     0           0.817     0.183 
 5 0     0           0.936     0.0639
 6 1     1           0.0177    0.982 
 7 0     0           0.989     0.0114
 8 0     0           0.989     0.0114
 9 0     0           0.911     0.0891
10 0     0           0.702     0.298 
# ℹ 103 more rows
#cross validation  20
anxiety_folds<-vfold_cv(anxiety_train, v=20)

glm_workflow<-workflow()|>
  add_model(lr_class_spec)|>
  add_formula(gad2_cg_cat ~ .)

glm_fit_cv<-glm_workflow|>
  fit_resamples(anxiety_folds, control=control_resamples(save_pred=TRUE))
collect_metrics(glm_fit_cv)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.822    20  0.0172 Preprocessor1_Model1
2 brier_class binary     0.128    20  0.0113 Preprocessor1_Model1
3 roc_auc     binary     0.867    20  0.0204 Preprocessor1_Model1
#AUC 0.8808

anxiety_glm_cv_preds<-collect_predictions(glm_fit_cv)
anxiety_glm_cv_preds |>
  group_by(id) |>
  roc_curve(truth = gad2_cg_cat, .pred_0) |>
  autoplot()

#Prediction on the test data
anxiety.lr.pred.values.test <-  bind_cols(
  truth = anxiety_test$gad2_cg_cat,
  predict(lr_class_fit, anxiety_test),
  predict(lr_class_fit, anxiety_test, type = "prob")
)
anxiety.lr.pred.values.test
# A tibble: 113 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 0     0             0.933  0.0668
 2 1     1             0.209  0.791 
 3 0     0             0.933  0.0668
 4 1     1             0.244  0.756 
 5 0     0             0.667  0.333 
 6 1     0             0.890  0.110 
 7 0     0             0.933  0.0668
 8 0     0             0.933  0.0668
 9 0     0             0.912  0.0883
10 0     1             0.456  0.544 
# ℹ 103 more rows
#Plot of ROC curve of prediction on test results
autoplot(roc_curve(anxiety.lr.pred.values.test, 
                   truth, 
                   .pred_0))

#Metrics of prediction on test data
metrics(anxiety.lr.pred.values.test, truth, .pred_class, .pred_0)
# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.841
2 kap         binary         0.564
3 mn_log_loss binary         0.443
4 roc_auc     binary         0.853
#roc_auc 0.7938

Depression

#creating training/ testing data
depression_split <- initial_split(depression, 
                            prop = 0.80)
depression_split #<450/113/563>
<Training/Testing/Total>
<450/113/563>
depression_train <- training(depression_split)
depression_test <- testing(depression_split)

#glm
set.seed(123)
lr_class_spec<-logistic_reg()|>
  set_engine("glm")

depression_train$phq2_cg_cat <- as.factor(depression_train$phq2_cg_cat)
depression_test$phq2_cg_cat <- as.factor(depression_test$phq2_cg_cat)

lr_class_fit <- lr_class_spec |>
  fit(phq2_cg_cat ~ ., data = depression_train)

## top variables
significant_predictors_depression <- tidy(lr_class_fit$fit) |>
  filter(p.value < 0.05)
print(significant_predictors_depression)
# A tibble: 4 × 5
  term          estimate std.error statistic     p.value
  <chr>            <dbl>     <dbl>     <dbl>       <dbl>
1 cac11exhaustd   -1.22      0.237     -5.16 0.000000252
2 race_recode     -0.601     0.196     -3.07 0.00213    
3 gender_recode   -0.658     0.335     -1.97 0.0494     
4 edu_recode      -0.632     0.246     -2.57 0.0101     
### 
depression_glm_wf <- workflow() |>
  add_model(lr_class_spec) |>
  add_formula(phq2_cg_cat ~ .)

# Fit the workflow to the test data
depression_glm_fit <- depression_glm_wf |> 
  fit(data = depression_test)

# Generate predictions with probabilities
depression_glm_predicted <- predict(depression_glm_fit, new_data = depression_test, type = "prob")

depression_glm_predicted 
# A tibble: 113 × 2
   .pred_0 .pred_1
     <dbl>   <dbl>
 1  0.0864  0.914 
 2  0.0417  0.958 
 3  0.958   0.0417
 4  0.182   0.818 
 5  0.947   0.0526
 6  0.947   0.0526
 7  0.947   0.0526
 8  0.0141  0.986 
 9  0.947   0.0526
10  0.947   0.0526
# ℹ 103 more rows
# Combine into a single data frame
depression_glm_pred_values <- bind_cols(
  truth = depression_test$phq2_cg_cat,  # Actual values of the outcome variable
  predict(depression_glm_fit, new_data = depression_test),  # Predicted class labels
  predict(depression_glm_fit, new_data = depression_test, type = "prob")  # Predicted probabilities
)

print(depression_glm_pred_values)
# A tibble: 113 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 1     1            0.0864  0.914 
 2 1     1            0.0417  0.958 
 3 0     0            0.958   0.0417
 4 1     1            0.182   0.818 
 5 0     0            0.947   0.0526
 6 0     0            0.947   0.0526
 7 0     0            0.947   0.0526
 8 1     1            0.0141  0.986 
 9 0     0            0.947   0.0526
10 0     0            0.947   0.0526
# ℹ 103 more rows
#cross validation  20
depression_folds<-vfold_cv(depression_train, v=20)

depression_glm_wf <- workflow ()|>
  add_model(lr_class_spec)|>
  add_formula(phq2_cg_cat ~ .)

depression_glm_fit_cv<-depression_glm_wf |>
  fit_resamples(depression_folds, control=control_resamples(save_pred=TRUE))
collect_metrics(depression_glm_fit_cv)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.789    20  0.0225 Preprocessor1_Model1
2 brier_class binary     0.144    20  0.0109 Preprocessor1_Model1
3 roc_auc     binary     0.829    20  0.0279 Preprocessor1_Model1
#AUC 0.8219

depression_glm_cv_preds<-collect_predictions(depression_glm_fit_cv)
depression_glm_cv_preds |>
  group_by(id) |>
  roc_curve(truth = phq2_cg_cat, .pred_0) |>
  autoplot()

#Prediction on the test data
depression.lr.pred.values.test <-  bind_cols(
  truth = depression_test$phq2_cg_cat,
  predict(lr_class_fit, depression_test),
  predict(lr_class_fit, depression_test, type = "prob")
)
depression.lr.pred.values.test
# A tibble: 113 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 1     0             0.805  0.195 
 2 1     1             0.280  0.720 
 3 0     0             0.841  0.159 
 4 1     0             0.846  0.154 
 5 0     0             0.918  0.0817
 6 0     0             0.918  0.0817
 7 0     0             0.918  0.0817
 8 1     1             0.234  0.766 
 9 0     0             0.918  0.0817
10 0     0             0.918  0.0817
# ℹ 103 more rows
#Plot of ROC curve of prediction on test results
autoplot(roc_curve(depression.lr.pred.values.test, 
                   truth, 
                   .pred_0))

#Metrics of prediction on test data
metrics(depression.lr.pred.values.test, truth, .pred_class, .pred_0)
# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.796
2 kap         binary         0.485
3 mn_log_loss binary         0.407
4 roc_auc     binary         0.884
#roc_auc 0.8837

4.2.4 Random Forest (RF)

I specified a random forest model with 1,000 trees and a minimum node size of 5, using the randomForest engine for classification and enabling variable importance. I trained the model on the anxiety_train dataset and visualized the top predictors based on Mean Decrease Gini.

For model validation, I performed 20-fold cross-validation, integrating the random forest model into a workflow. The model achieved a strong ROC-AUC score of 0.91, demonstrating excellent classification performance.

4.2.4.1 Anxiety

library(tune)
library(parsnip)
library(recipes)
library(rsample)
library(workflows)

rf_spec<-rand_forest(trees=1000, min_n=5)|>
  set_engine("randomForest", importance=TRUE)|>
  set_mode("classification")

rf_fit<-rf_spec|>
  fit(gad2_cg_cat ~ ., data=anxiety_train)

## top variables
rf_fit|>
  extract_fit_engine()|>
  vip()

rf_fit|>
  extract_fit_engine()|>
  importance()|>
  as.data.frame()|>
  arrange(desc(MeanDecreaseGini))
                         0          1 MeanDecreaseAccuracy MeanDecreaseGini
cac11exhaustd   16.2978164 29.5984992            26.143118        26.213489
chd11dage       14.7637550 14.1101790            19.100553        19.405637
che11sleepint   13.4842426 18.0759580            19.199806        16.169002
race_recode     11.4833050 20.5639416            16.769562        12.821931
che11health     16.3274496  3.4877379            16.402138        10.891260
cac11diffphy    15.7299475 11.4245716            18.210267         8.945355
che11enrgylmt   21.8162882 11.4885835            22.691647         8.599261
cac11diffemlv   11.4355498  7.8983492            13.973246         8.192301
cca11hwoftchs    6.1413528 -4.1449183             3.731027         7.165994
cca11hwoftshp    9.1767370 -1.0531558             7.803121         6.693282
edu_recode       9.2651016 -0.7891728             8.383288         5.733373
cca11hwoftott   12.5604459 -1.2274407            10.073107         3.731203
gender_recode    4.6859000 -2.7335837             2.997123         3.001349
martstat_recode  0.6269767  2.3828227             1.646233         2.863170
ew11progneed1    3.0961447 -0.5341457             1.819173         2.023219
#20fold CV- rf
anxiety_folds<-vfold_cv(anxiety_train, v=20)
rf_workflow <- workflow() |>
  add_model(rf_spec) |>
  add_formula(gad2_cg_cat ~ .)

rf_fit_cv <- rf_workflow |>
  fit_resamples(anxiety_folds, control=control_resamples(save_pred=TRUE))
collect_metrics(rf_fit_cv) 
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.844    20 0.0155  Preprocessor1_Model1
2 brier_class binary     0.107    20 0.00851 Preprocessor1_Model1
3 roc_auc     binary     0.910    20 0.0146  Preprocessor1_Model1
#roc_auc = 0.9100
anxiety_rf_cv_preds<-collect_predictions(rf_fit_cv)
anxiety_rf_cv_preds|>
  group_by(id)|>
  roc_auc(truth = gad2_cg_cat, .pred_0)
# A tibble: 20 × 4
   id     .metric .estimator .estimate
   <chr>  <chr>   <chr>          <dbl>
 1 Fold01 roc_auc binary         0.858
 2 Fold02 roc_auc binary         0.735
 3 Fold03 roc_auc binary         0.922
 4 Fold04 roc_auc binary         0.982
 5 Fold05 roc_auc binary         0.933
 6 Fold06 roc_auc binary         1    
 7 Fold07 roc_auc binary         0.946
 8 Fold08 roc_auc binary         0.95 
 9 Fold09 roc_auc binary         0.822
10 Fold10 roc_auc binary         0.933
11 Fold11 roc_auc binary         0.958
12 Fold12 roc_auc binary         0.895
13 Fold13 roc_auc binary         0.875
14 Fold14 roc_auc binary         0.857
15 Fold15 roc_auc binary         0.847
16 Fold16 roc_auc binary         1    
17 Fold17 roc_auc binary         0.876
18 Fold18 roc_auc binary         0.952
19 Fold19 roc_auc binary         0.948
20 Fold20 roc_auc binary         0.901
# Plot ROC curve
anxiety_rf_cv_preds |>
  group_by(id)|>
  roc_curve(truth = gad2_cg_cat, .pred_0) |>
  autoplot()

# Fit the random forest model on the full training data
anxiety_rf_fit <- rf_spec |>
  fit(gad2_cg_cat ~ ., data = anxiety_train)

anxiety_rf_fit
parsnip model object


Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, nodesize = min_rows(~5,      x), importance = ~TRUE) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 3

        OOB estimate of  error rate: 16.89%
Confusion matrix:
    0  1 class.error
0 295 31  0.09509202
1  45 79  0.36290323
#testing
anxiety_rf_pred_values <- bind_cols(
  truth = anxiety_test$gad2_cg_cat,  # Actual values of the outcome variable
  predict(anxiety_rf_fit, new_data = anxiety_test),  # Predicted class labels
  predict(anxiety_rf_fit, new_data = anxiety_test, type = "prob")  # Predicted probabilities
)
roc_auc(anxiety_rf_pred_values,
        truth, 
        .pred_0)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.902
#roc_auc = 0.8577
autoplot(roc_curve(anxiety_rf_pred_values, 
                   truth, 
                   .pred_0))

I did cross-validation predictions and calculated the ROC-AUC for each fold. Then, I plotted the ROC curve for the cross-validation results. Afterward, I fitted the random forest model to the full training data (anxiety_train) and predicted class labels and probabilities on the test data, achieving a ROC-AUC of 0.8577. Finally, you plotted the ROC curve for the test set predictions.

anxiety_rf_fit |>
  extract_fit_engine() |>
  importance()
                        0          1 MeanDecreaseAccuracy MeanDecreaseGini
cca11hwoftchs    6.443491 -3.2775802            4.2523871         7.800333
cca11hwoftshp    9.426949 -1.3433393            8.0331691         6.681948
cca11hwoftott   11.745977 -0.9118149            9.5197680         3.774009
che11enrgylmt   20.326375 11.1013733           21.2825482         8.189687
cac11diffphy    14.107192 12.1828828           17.3469957         8.211464
cac11exhaustd   15.450773 29.0439431           25.2270290        25.035438
cac11diffemlv   11.156632  8.7287032           14.1944949         8.753394
che11health     17.236296  1.7201877           16.5372772        11.333187
che11sleepint   13.073604 17.9197566           18.2365116        17.104394
ew11progneed1    1.679956 -0.4306840            0.7234876         2.096740
race_recode     10.902965 19.6620911           15.9483695        13.227973
gender_recode    5.730777 -2.6358750            4.1243945         2.951120
edu_recode      10.003250 -0.7237411            9.0343178         5.573622
chd11dage       15.254190 15.1308686           19.2096452        19.142142
martstat_recode  0.727953  1.3932472            1.2697876         2.888540
anxiety_rf_fit |>
  extract_fit_engine() |>
  vip()

#20-fold cross validation on rf
anxiety_folds<-vfold_cv(anxiety_train, v=20)

rf_workflow <- workflow() |>
  add_model(rf_spec) |>
  add_formula(gad2_cg_cat ~ .)

rf_fit_cv_anxiety <- rf_workflow |>
  fit_resamples(resamples = anxiety_folds)

collect_metrics(rf_fit_cv_anxiety)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.831    20 0.0199  Preprocessor1_Model1
2 brier_class binary     0.111    20 0.00803 Preprocessor1_Model1
3 roc_auc     binary     0.897    20 0.0153  Preprocessor1_Model1
#roc_auc = 0.917

rf_wf_fit_cv_anxiety <- rf_workflow |>
  fit_resamples(
    resamples = anxiety_folds,
    control = control_resamples(save_pred = TRUE)
  )

collect_metrics(rf_wf_fit_cv_anxiety) #roc_auc = 0.9154106  
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.835    20 0.0193  Preprocessor1_Model1
2 brier_class binary     0.110    20 0.00818 Preprocessor1_Model1
3 roc_auc     binary     0.900    20 0.0151  Preprocessor1_Model1
rf_wf_fit_cv_anxiety |>
  collect_predictions() |>
  roc_curve(gad2_cg_cat, .pred_0) |>
  autoplot()

# Collect metrics from the resampling results
rf_wf_fit_cv_anxiety_metrics <- collect_metrics(rf_wf_fit_cv_anxiety)
#roc_auc = 0.9213547    

# If you need predictions for further analysis
rf_wf_fit_cv_anxiety_preds <- collect_predictions(rf_wf_fit_cv_anxiety)

The random forest model demonstrated strong performance with consistent ROC AUC values across different evaluations: 0.917, 0.9154, and 0.9213. These results indicate the model’s ability to effectively distinguish between the classes. The ROC curve further confirmed this, showing the model’s good classification ability.

4.2.4.2 Depression

rf_fit |>
  extract_fit_engine() |>
  vip()

rf_spec<-rand_forest(trees=1000, min_n=5)|>
  set_engine("randomForest", importance=TRUE)|>
  set_mode("classification")

depression_rf_fit<-rf_spec|>
  fit(phq2_cg_cat ~ ., data=depression_train)

## top variables
depression_rf_fit |>
  extract_fit_engine() |>
  importance() |>
  as.data.frame() |>
  arrange(desc(MeanDecreaseGini))
                        0          1 MeanDecreaseAccuracy MeanDecreaseGini
cac11exhaustd    9.558464 26.0930085            19.770143        19.057579
chd11dage       12.532606 12.5375436            16.105794        18.914430
race_recode      9.323061 19.0617623            15.055487        13.474596
che11health     21.524245  6.9597969            22.137742        13.013995
che11sleepint   11.058006  6.8646282            13.047699        10.775765
che11enrgylmt   18.884763 14.0427339            21.717931         9.372276
cac11diffemlv   15.249216  6.9186701            16.561285         7.787694
cca11hwoftchs    9.222754 -2.6027471             7.569520         7.747071
cca11hwoftshp   12.033338  0.5310029            11.581256         7.636812
cac11diffphy    13.913964  2.8148042            14.207516         7.166716
edu_recode      10.031214  1.1738487             9.920600         6.258746
cca11hwoftott   14.270968  1.3846495            13.327087         3.707555
martstat_recode  9.107561 -1.1345127             8.304831         3.295852
gender_recode    9.840679 -0.6561558             9.101497         3.150413
ew11progneed1    2.596320  1.6455922             3.134761         1.829438
depression_rf_fit|>
  extract_fit_engine()|>
  importance()|>
  as.data.frame()|>
  arrange(desc(MeanDecreaseGini))
                        0          1 MeanDecreaseAccuracy MeanDecreaseGini
cac11exhaustd    9.558464 26.0930085            19.770143        19.057579
chd11dage       12.532606 12.5375436            16.105794        18.914430
race_recode      9.323061 19.0617623            15.055487        13.474596
che11health     21.524245  6.9597969            22.137742        13.013995
che11sleepint   11.058006  6.8646282            13.047699        10.775765
che11enrgylmt   18.884763 14.0427339            21.717931         9.372276
cac11diffemlv   15.249216  6.9186701            16.561285         7.787694
cca11hwoftchs    9.222754 -2.6027471             7.569520         7.747071
cca11hwoftshp   12.033338  0.5310029            11.581256         7.636812
cac11diffphy    13.913964  2.8148042            14.207516         7.166716
edu_recode      10.031214  1.1738487             9.920600         6.258746
cca11hwoftott   14.270968  1.3846495            13.327087         3.707555
martstat_recode  9.107561 -1.1345127             8.304831         3.295852
gender_recode    9.840679 -0.6561558             9.101497         3.150413
ew11progneed1    2.596320  1.6455922             3.134761         1.829438
#20fold CV- rf
depression_folds<-vfold_cv(depression_train, v=20)
depression_rf_workflow <- workflow() |>
  add_model(rf_spec) |>
  add_formula(phq2_cg_cat ~ .)

depression_rf_fit_cv <- depression_rf_workflow |>
  fit_resamples(depression_folds, control=control_resamples(save_pred=TRUE))
collect_metrics(depression_rf_fit_cv) 
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.823    20 0.0118  Preprocessor1_Model1
2 brier_class binary     0.118    20 0.00621 Preprocessor1_Model1
3 roc_auc     binary     0.879    20 0.0188  Preprocessor1_Model1
#roc_auc = 0.8975

I created a random forest model with 1000 trees and a minimum node size of 5 to classify depression using the phq2_cg_cat variable. After fitting the model to the training data, I extracted the importance of the top variables, using the MeanDecreaseGini measure to identify the most influential features.

For evaluation, I performed 20-fold cross-validation on the training data, resulting in a ROC AUC of 0.8975, indicating good performance in classifying depression.

depression_rf_cv_preds<-collect_predictions(depression_rf_fit_cv)

depression_rf_cv_preds|>
  group_by(id)|>
  roc_auc(truth = phq2_cg_cat, .pred_0)
# A tibble: 20 × 4
   id     .metric .estimator .estimate
   <chr>  <chr>   <chr>          <dbl>
 1 Fold01 roc_auc binary         0.938
 2 Fold02 roc_auc binary         0.803
 3 Fold03 roc_auc binary         0.946
 4 Fold04 roc_auc binary         0.933
 5 Fold05 roc_auc binary         0.933
 6 Fold06 roc_auc binary         0.789
 7 Fold07 roc_auc binary         0.908
 8 Fold08 roc_auc binary         0.684
 9 Fold09 roc_auc binary         0.951
10 Fold10 roc_auc binary         0.958
11 Fold11 roc_auc binary         0.756
12 Fold12 roc_auc binary         0.807
13 Fold13 roc_auc binary         0.958
14 Fold14 roc_auc binary         0.982
15 Fold15 roc_auc binary         0.778
16 Fold16 roc_auc binary         0.896
17 Fold17 roc_auc binary         0.847
18 Fold18 roc_auc binary         0.906
19 Fold19 roc_auc binary         0.847
20 Fold20 roc_auc binary         0.958
# Plot ROC curve
depression_rf_cv_preds |>
  group_by(id)|>
  roc_curve(truth = phq2_cg_cat, .pred_0) |>
  autoplot()

# Fit the random forest model on the full training data
depression_rf_fit <- rf_spec |>
  fit(phq2_cg_cat ~ ., data = depression_train)

depression_rf_fit
parsnip model object


Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, nodesize = min_rows(~5,      x), importance = ~TRUE) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 3

        OOB estimate of  error rate: 18.44%
Confusion matrix:
    0  1 class.error
0 302 31  0.09309309
1  52 65  0.44444444
#testing
depression_rf_pred_values <- bind_cols(
  truth = depression_test$phq2_cg_cat,  # Actual values of the outcome variable
  predict(depression_rf_fit, new_data = depression_test),  # Predicted class labels
  predict(depression_rf_fit, new_data = depression_test, type = "prob")  # Predicted probabilities
)
roc_auc(depression_rf_pred_values,
        truth, 
        .pred_0)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.889
#roc_auc = 0.9109

autoplot(roc_curve(depression_rf_pred_values, 
                   truth, 
                   .pred_0))

I collected the predictions from the 20-fold cross-validation of the random forest model and calculated the ROC AUC for each fold, confirming the model’s performance. The ROC curve was plotted to visualize the classification ability.

After fitting the model on the full training data, I tested it on the separate test set. The model achieved a ROC AUC of 0.9109 on the test data, demonstrating strong classification performance. The ROC curve for the test predictions further supported this finding.

depression_rf_fit |>
  extract_fit_engine() |>
  importance()
                        0          1 MeanDecreaseAccuracy MeanDecreaseGini
cca11hwoftchs   10.411322 -2.3867168             8.784269         8.576343
cca11hwoftshp   12.425121 -0.4427039            11.597314         7.216403
cca11hwoftott   12.507147  1.4734672            11.827870         3.898171
che11enrgylmt   18.800267 14.1945021            21.854479         9.026295
cac11diffphy    14.978531  4.3890169            15.629197         7.505065
cac11exhaustd   11.016451 25.5871374            20.464713        19.445610
cac11diffemlv   12.611825  8.3283489            14.987783         7.799484
che11health     20.047631  6.7396651            20.961105        12.802986
che11sleepint   10.006394  8.6619340            12.371478        10.888827
ew11progneed1    2.675039 -1.5364982             1.055941         2.001016
race_recode      9.882035 20.1273871            15.792884        13.597460
gender_recode   10.043201 -3.1795128             8.383588         3.184998
edu_recode       8.305797  0.5273818             8.135316         6.059345
chd11dage       13.645514 12.5004959            16.969906        18.977284
martstat_recode  7.598995  1.0108678             7.532664         3.175052
depression_rf_fit |>
  extract_fit_engine() |>
  vip()

rf_spec_depression<-rand_forest(trees=1000, min_n=5)|>
  set_engine("randomForest", importance=TRUE)|>
  set_mode("classification")

rf_fit_depression<-rf_spec_depression|>
  fit(phq2_cg_cat ~ ., data=depression_train)


#20-fold cross validation on rf
depression_folds<-vfold_cv(depression_train, v=20)

rf_workflow_depression <- workflow() |>
  add_model(rf_spec_depression) |>
  add_formula(phq2_cg_cat ~ .)

rf_fit_cv_depression <- rf_workflow_depression |>
  fit_resamples(resamples = depression_folds)

collect_metrics(rf_fit_cv_depression)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.805    20 0.0198  Preprocessor1_Model1
2 brier_class binary     0.121    20 0.00803 Preprocessor1_Model1
3 roc_auc     binary     0.890    20 0.0181  Preprocessor1_Model1
#roc_auc = 0.8848794        

rf_wf_fit_cv_depression <- rf_workflow_depression |>
  fit_resamples(
    resamples = depression_folds,
    control = control_resamples(save_pred = TRUE)
  )

collect_metrics(rf_wf_fit_cv_depression) #roc_auc = 0.8985  
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.814    20 0.0202  Preprocessor1_Model1
2 brier_class binary     0.120    20 0.00783 Preprocessor1_Model1
3 roc_auc     binary     0.892    20 0.0183  Preprocessor1_Model1
rf_wf_fit_cv_depression |>
  collect_predictions() |>
  roc_curve(phq2_cg_cat, .pred_0) |>
  autoplot()

# Collect metrics from the resampling results

collect_metrics(rf_wf_fit_cv_depression)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.814    20 0.0202  Preprocessor1_Model1
2 brier_class binary     0.120    20 0.00783 Preprocessor1_Model1
3 roc_auc     binary     0.892    20 0.0183  Preprocessor1_Model1
#roc_auc = 0.89855      

# If you need predictions for further analysis
rf_wf_fit_cv_depression_preds <- collect_predictions(rf_wf_fit_cv_depression)

The ROC AUC for the cross-validation results was 0.8995, indicating good model performance. After including the save_pred = TRUE control option, which stores predictions for each fold, I recalculated the ROC AUC, obtaining 0.8885.

4.2.4.3 Graph

Anxiety

# ROC curve for GLM training set
anxiety_roc_glm_training <- anxiety_glm_pred_values |> roc_curve(truth, .pred_0)
# ROC curve for GLM cross-validation set
anxiety_roc_glm_cv <- anxiety.lr.pred.values.test |> roc_curve(truth, .pred_0)
# ROC curve for RF training set
anxiety_roc_rf_training <- anxiety_rf_pred_values|> roc_curve(truth, .pred_0)
# ROC curve for RF cross-validation set
anxiety_roc_rf_cv <- rf_wf_fit_cv_anxiety_preds |>
  roc_curve(truth = gad2_cg_cat, .pred_0)


ggplot() +
  geom_path(data = anxiety_roc_glm_training, aes(x = 1 - specificity, y = sensitivity), 
            color = "blue", linetype = "solid", size = 1, label = "Logistic Regression (Training)") +
  geom_path(data = anxiety_roc_glm_cv, aes(x = 1 - specificity, y = sensitivity), 
            color = "green", linetype = "dashed", size = 1, label = "Logistic Regression (10-fold CV)") +
  geom_path(data = anxiety_roc_rf_training, aes(x = 1 - specificity, y = sensitivity), 
            color = "red", linetype = "solid", size = 1, label = "Random Forest (Training)") +
  geom_path(data = anxiety_roc_rf_cv, aes(x = 1 - specificity, y = sensitivity), 
            color = "purple", linetype = "dashed", size = 1, label = "Random Forest (10-fold CV)") +
coord_equal() +
  theme_bw() +
  labs(title = "Anxiety ROC Curves for Logistic Regression and Random Forest",
       x = "1 - Specificity (False Positive Rate)",
       y = "Sensitivity (True Positive Rate)",
       color = "Model", 
       linetype = "Type") +
  theme(legend.position = "right") + 
   geom_text(aes(x = 0.7, y = 0.2, label = "Logistic Regression (Training)"), 
            color = "blue", size = 2) +
  geom_text(aes(x = 0.7, y = 0.15, label = "Logistic Regression (10-fold CV)"), 
            color = "green", size = 2, linetype = "dashed") +
  geom_text(aes(x = 0.7, y = 0.1, label = "Random Forest (Training)"), 
            color = "red", size = 2) +
  geom_text(aes(x = 0.7, y = 0.05, label = "Random Forest (10-fold CV)"), 
            color = "purple", size = 2, linetype = "dashed")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning in geom_path(data = anxiety_roc_glm_training, aes(x = 1 - specificity,
: Ignoring unknown parameters: `label`
Warning in geom_path(data = anxiety_roc_glm_cv, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_path(data = anxiety_roc_rf_training, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_path(data = anxiety_roc_rf_cv, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_text(aes(x = 0.7, y = 0.15, label = "Logistic Regression
(10-fold CV)"), : Ignoring unknown parameters: `linetype`
Warning in geom_text(aes(x = 0.7, y = 0.05, label = "Random Forest (10-fold
CV)"), : Ignoring unknown parameters: `linetype`

4.2.5 Depression

library(yardstick)

# ROC curve for GLM training set
depression_roc_glm_training <- depression_glm_pred_values |>
  roc_curve(truth = truth, .pred_0)

# ROC curve for GLM cross-validation set
depression_roc_glm_cv <- depression.lr.pred.values.test |>
  roc_curve(truth = truth, .pred_0)

# ROC curve for RF training set
depression_roc_rf_training <- depression_rf_pred_values |>
  roc_curve(truth = truth, .pred_0)

# ROC curve for RF cross-validation set
depression_roc_rf_cv <- depression_rf_cv_preds |>
  roc_curve(truth = phq2_cg_cat, .pred_0)


ggplot() +
  geom_path(data = depression_roc_glm_training, aes(x = 1 - specificity, y = sensitivity), 
            color = "blue", linetype = "solid", size = 1, label = "Logistic Regression (Training)") +
  geom_path(data = depression_roc_glm_cv, aes(x = 1 - specificity, y = sensitivity), 
            color = "green", linetype = "dashed", size = 1, label = "Logistic Regression (10-fold CV)") +
  geom_path(data = depression_roc_rf_training, aes(x = 1 - specificity, y = sensitivity), 
            color = "red", linetype = "solid", size = 1, label = "Random Forest (Training)") +
  geom_path(data = depression_roc_rf_cv, aes(x = 1 - specificity, y = sensitivity), 
            color = "purple", linetype = "dashed", size = 1, label = "Random Forest (10-fold CV)") +
coord_equal() +
  theme_bw() +
  labs(title = "Depression ROC Curves for Logistic Regression and Random Forest",
       x = "1 - Specificity (False Positive Rate)",
       y = "Sensitivity (True Positive Rate)",
       color = "Model", 
       linetype = "Type") +
  theme(legend.position = "right") + 
   geom_text(aes(x = 0.5, y = 0.2, label = "Logistic Regression (Training)"), 
            color = "blue", size = 2) +
  geom_text(aes(x = 0.5, y = 0.15, label = "Logistic Regression (10-fold CV)"), 
            color = "green", size = 2, linetype = "dashed") +
  geom_text(aes(x = 0.5, y = 0.1, label = "Random Forest (Training)"), 
            color = "red", size = 2) +
  geom_text(aes(x = 0.5, y = 0.05, label = "Random Forest (10-fold CV)"), 
            color = "purple", size = 2, linetype = "dashed")
Warning in geom_path(data = depression_roc_glm_training, aes(x = 1 -
specificity, : Ignoring unknown parameters: `label`
Warning in geom_path(data = depression_roc_glm_cv, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_path(data = depression_roc_rf_training, aes(x = 1 -
specificity, : Ignoring unknown parameters: `label`
Warning in geom_path(data = depression_roc_rf_cv, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_text(aes(x = 0.5, y = 0.15, label = "Logistic Regression
(10-fold CV)"), : Ignoring unknown parameters: `linetype`
Warning in geom_text(aes(x = 0.5, y = 0.05, label = "Random Forest (10-fold
CV)"), : Ignoring unknown parameters: `linetype`

Limitations:
Despite the robust analysis and promising results, there are several limitations to consider. First, the sample size for both the training and testing sets may not fully capture the diversity of the target population, potentially limiting the generalizability of the findings. Second, while cross-validation helps to reduce overfitting, it does not account for all potential biases or variations within different subgroups of the data. Third, missing data, if not adequately handled, could have introduced bias or reduced the accuracy of model predictions. Furthermore, the models used are based on observed data and may not fully capture complex relationships or interactions that could emerge with a broader set of variables or external factors. Lastly, the reliance on specific features may limit the scope of the model’s performance in more dynamic or varied real-world scenarios.

4.3 Conclusion

In conclusion, we performed 20-fold cross-validation on both anxiety and depression datasets using GLM and random forest models. For both outcomes, we achieved high AUC values, indicating strong model performance. The variable importance analysis highlighted key features influencing predictions. Additionally, the ROC curve plots demonstrated good classification accuracy for both anxiety (AUC = 0.91) and depression (AUC = 0.91) models. These results suggest that the random forest models are effective in predicting anxiety and depression based on the available features.